Skip to content

feat: add fastCRW tool block#5025

Open
us wants to merge 4 commits into
simstudioai:mainfrom
us:feat/add-fastcrw
Open

feat: add fastCRW tool block#5025
us wants to merge 4 commits into
simstudioai:mainfrom
us:feat/add-fastcrw

Conversation

@us

@us us commented Jun 13, 2026

Copy link
Copy Markdown

What

Adds fastCRW as a tool block (scrape / crawl / map / search), mirroring the existing Firecrawl block.

Why

fastCRW is a genuinely more open, faster, and higher-quality web engine than Firecrawl — and it runs completely locally.

  • Full capability in open core, runs 100% locally: Anti-bot/stealth bypass, BYO-proxy with rotation, and JS rendering all ship in the open core (AGPL). Firecrawl's OSS build gates its stealth engine (fire-engine) behind a cloud-only flag — so a self-hosted Firecrawl cannot reach Cloudflare-protected or JS-heavy sites. fastCRW's self-host can. One binary, no cloud dependency, no asterisks.
  • Faster + higher-quality on Firecrawl's own benchmark dataset: truth-recall 63.74% vs 56.04%, with lower median latency (p50 ~1.9 s vs ~2.3 s). Ships as a single ~8 MB Rust binary using ~6 MB RAM.
  • Search built on SearXNG, not just backed by it: crw is not an alternative to SearXNG — it is built on top of it. SearXNG is the metasearch aggregator underneath; crw adds a quality layer: query expansion (multi-variant rewrite), content-aware reranking (re-scoring by fetched content instead of SearXNG's content-blind ordering), and category routing (research queries fan out to arxiv/semantic scholar, code queries to GitHub). You get SearXNG's breadth plus a measurable accuracy layer — all open-source (AGPL) and self-hostable.

Firecrawl-API compatibility is why the integration is a tiny additive diff that slots in alongside the existing Firecrawl block with no regressions.

Changes (additive only)

  • apps/sim/tools/crw/: scrape/crawl/map/search + types (mirrors tools/firecrawl/).
  • apps/sim/blocks/blocks/crw.ts + registered in blocks/registry.ts, tools/registry.ts.
  • Icon, CSP allowlist entry, BYOK key entry, integrations.json — every place Firecrawl is registered.

Config

CRW_API_KEY from https://fastcrw.com/dashboard (free tier); base URL overridable for self-host.


Happy to adjust — I maintain the integration and can provide free credits.

@vercel

vercel Bot commented Jun 13, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 19, 2026 9:40pm

Request Review

@cursor

cursor Bot commented Jun 13, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Purely additive integration with no changes to existing providers; main runtime surface is outbound API calls and crawl polling, matching established Firecrawl patterns.

Overview
Adds fastCRW as a new web data integration alongside Firecrawl: a workflow block with scrape, search, crawl, and map operations, wired to four new tools that call Firecrawl-compatible /v1/* endpoints on https://fastcrw.com/api (or a user-supplied Base URL for self-host).

Registration is additive everywhere Firecrawl already appears: block and tool registries, integrations.json, BYOK (crw under Search & web), BYOKProviderId / API contracts, icon mapping, and CSP connect-src for https://fastcrw.com. Tools are BYOK-only (CRW_API_KEY, zero Sim metering); crawl creates an async job and polls until completion or the execution timeout. Block meta adds templates and agent skills; crw.test.ts covers URL resolution, request shaping, and response mapping.

Reviewed by Cursor Bugbot for commit bb6839d. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread apps/sim/tools/crw/crawl.ts
Comment thread apps/sim/tools/crw/crawl.ts
formats: params.formats || ['markdown'],
onlyMainContent: params.onlyMainContent || false,
},
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crawl sends maxPages not limit

Medium Severity

The crawl request body sends maxPages, while fastCRW’s Firecrawl-compatible POST /v1/crawl expects limit for the page cap. The block’s Max Pages value is ignored and the service falls back to its default crawl size.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2964aed. Configure here.

@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds fastCRW as a new tool block (scrape / crawl / map / search), mirroring the existing Firecrawl block. The integration is additive-only: new files under tools/crw/ and blocks/blocks/crw.ts, plus registration in the block/tool registries, BYOK keys, CSP allowlist, icon, and integrations.json.

  • Four tool configs (crw_scrape, crw_search, crw_crawl, crw_map) mirror Firecrawl's structure with fastCRW-specific differences: maxPages instead of limit for crawl, a dynamic baseUrl param for self-hosting, and a resolveCrwBaseUrl helper.
  • Registration is complete across all required locations (BYOK schema, type union, CSP, icon mapping, integrations JSON), and a test file covers URL construction, body building, and response transformation for all four operations.

Confidence Score: 4/5

The change is purely additive and isolated to new files; no existing functionality is modified. The three tools with hardcoded success responses will silently swallow API-level errors, but they won't cause data corruption or affect other blocks.

Three of the four new tools (scrape, search, crawl) always return success: true from transformResponse even when the API body indicates failure — the crawl case is the worst because an undefined jobId leads the poll loop to request /v1/crawl/undefined, masking the real error. The fourth tool (map) handles this correctly, making the inconsistency self-contained within this PR. No other part of the codebase is touched.

apps/sim/tools/crw/scrape.ts, apps/sim/tools/crw/search.ts, apps/sim/tools/crw/crawl.ts — the transformResponse functions in all three need to check data.success before reporting a successful result.

Important Files Changed

Filename Overview
apps/sim/blocks/blocks/crw.ts New block config mirroring Firecrawl; routes scrape/search/crawl/map to the correct crw_* tools, formats params, and exposes baseUrl for self-hosting. Clean and consistent with existing block patterns.
apps/sim/tools/crw/scrape.ts Scrape tool is structurally correct but hardcodes success: true in transformResponse regardless of API-level errors, unlike map.ts which properly checks data.success.
apps/sim/tools/crw/search.ts Search tool also hardcodes success: true in transformResponse; same inconsistency with map.ts. Additionally, limit and sources params are used in the body builder but not declared in the tool's params definition (though this mirrors the Firecrawl search pattern).
apps/sim/tools/crw/crawl.ts Crawl tool implements async polling correctly, but transformResponse ignores data.success — if job creation returns HTTP 200 with success:false, postProcess will poll /v1/crawl/undefined leading to a confusing 404 error instead of the real failure.
apps/sim/tools/crw/map.ts Map tool correctly checks data.success in transformResponse and handles missing links with a fallback array. Well-structured and complete.
apps/sim/tools/crw/types.ts Comprehensive type definitions and output property constants. Clean mirror of the Firecrawl types, with appropriate additions for fastCRW-specific fields.
apps/sim/tools/crw/crw.test.ts Good coverage of URL construction, body building, and response transformation for all four operations. Tests document the expected API response shapes clearly.
apps/sim/lib/core/security/csp.ts Adds https://fastcrw.com to connect-src allowlist. Covers the full domain/origin, which is sufficient since the API lives at /api/v1/* on the same origin.
apps/sim/tools/crw/base-url.ts Clean utility for resolving the base URL, with trailing-slash stripping and a sensible default. Well-tested.
apps/sim/lib/api/contracts/byok-keys.ts Correctly adds 'crw' to the BYOK provider ID zod schema enum.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[CrwBlock - crw.ts] -->|operation=scrape| B[crw_scrape tool]
    A -->|operation=search| C[crw_search tool]
    A -->|operation=crawl| D[crw_crawl tool]
    A -->|operation=map| E[crw_map tool]

    B --> F["POST /v1/scrape\n(fastcrw.com/api)"]
    C --> G["POST /v1/search\n(fastcrw.com/api)"]
    D --> H["POST /v1/crawl\n(fastcrw.com/api)"]
    E --> I["POST /v1/map\n(fastcrw.com/api)"]

    D -->|async job| J[postProcess polling loop]
    J --> K["GET /v1/crawl/{jobId}"]
    K -->|completed| L[Return pages + total]
    K -->|failed| M[Return error]
    K -->|timeout| N[Return timeout error]

    B --> O[transformResponse - always success:true]
    C --> P[transformResponse - always success:true]
    E --> Q[transformResponse - checks data.success]
Loading

Comments Outside Diff (1)

  1. apps/sim/tools/crw/crawl.ts, line 623-634 (link)

    P2 transformResponse ignores API-level job creation failure

    If the crawl POST returns HTTP 200 with { success: false, error: "…" }, transformResponse still returns success: true with jobId: undefined. postProcess then checks if (!result.success) (passes), and proceeds to poll ${baseUrl}/v1/crawl/undefined, which returns a 404 and surfaces a confusing "Failed to get crawl status: Not Found" error rather than the original creation error. Guard against this by checking data.success (or at least data.id) in transformResponse before the poll loop begins.

Reviews (1): Last reviewed commit: "feat: add fastCRW tool block" | Re-trigger Greptile

Comment on lines +88 to +100
const result = data.data ?? data

return {
success: true,
output: {
markdown: result.markdown,
html: result.html,
metadata: result.metadata,
},
}
},

outputs: {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Scrape/search always report success: true regardless of API error body

Both scrape.ts and search.ts hardcode success: true in transformResponse. The map.ts counterpart correctly propagates data.success. When the fastCRW API returns HTTP 200 with { success: false, error: "…" } (e.g., invalid URL or auth error), the scrape and search tools will still emit success: true with undefined output fields, masking the failure from downstream blocks. map.ts shows the correct pattern: return success: data.success and reflect it in the output envelope.

Comment on lines +66 to +75
transformResponse: async (response: Response) => {
const data = await response.json()

return {
success: true,
output: {
data: data.data,
},
}
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Search always reports success: true on API-level failures

Same issue as scrape.tstransformResponse always returns success: true without checking data.success. The map.ts tool in this same PR correctly checks data.success. If the search API returns { success: false, error: "…" } with HTTP 200, downstream blocks see a successful result with data: undefined rather than a proper error.

scrape, search, and crawl transformResponse hardcoded success: true, masking HTTP 200 responses with { success: false, error }. They now reflect data.success and surface the error, matching map.ts. Crawl additionally fails fast when job creation has no id, preventing a poll loop against /v1/crawl/undefined. Adds error-path tests.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Want reviews to match your repository better? Bugbot Learning can learn team-specific rules from PR activity. A team admin can enable Learning in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a18fa5d. Configure here.

Comment thread apps/sim/tools/crw/crawl.ts Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant